FILENAME: FREEGRAM4.TXT AUTHOR: Jim Roskind Independent Consultant 516 Latania Palm Drive Indialantic FL 32903 (407)729-4348 jar@ileaf.com or ...uunet!leafusa!jar 6/6/90 Dear C++ and C Grammar User, I have written a YACC debugging tool, and a set of grammars for C and C++ in order to use them within my own personal project development. I have made my work in this area available to other developers at no charge with the hope that they would use my work. I believe the entire C++ community can benefit from such standardization. If any of the copyright notices on the grammars (which are VERY liberal) prevent using my work, please notify me of the problem. Note that the grammars can each be processed by YACC, but they are very clean, and make NO USE of the precedence setting (i.e.: %prec) or associativity setting (i.e.:%assoc) constructs of YACC. This feature should make them easily portable to other parser generator input format. This "cleanliness" fact also provides brutal exposure of all the complex constructs in C++, and the complexity of the grammar as a whole (the C++ grammar is 2 to 3 times as large as the C grammar) reflects this exposure. The files included in this set are: FREEGRM4.TXT This introductory file GRAMMAR4.TXT Parsing ambiguities in C++, and in my grammar CPP4.Y My YACC compatible C++ grammar C4.Y My YACC compatible, ANSI C conformant grammar CPP4.L Flex input file defining a C++ lexical analyser SKELGRPH.C A hacked file from the Berkeley YACC distribution. Aside from the addition of several files, this release of my grammar corrects a few problems located in my prior release. This release will hopefully conclude compliance with C++ 2.0, and allow me to release a C++ 2.1 compliant grammar (re: nested types). Since my first public release of my grammar, I have received a number of requests. One of the most common requests was for a lexical analyser to go with the grammar. This release of the grammar provides such a a "bare bones" lexical analyser. The analyser does not support preprocessing, or even comment removal. In addition, since I have not included a symbol table, or semantic actions in the grammar to maintain proper context (i.e., current scope), typedef names and struct/class/union/enum tags are not *really* defined. To allow users to experiment with my grammar without a symbol table, my lexer assumes that if the first letter of the name is upper case, then then name is a type name. This hack is far from sufficient for parsing full blown programs, but it is more than sufficient for experimenting with the grammar to determine the acceptability of a token sequence, and to understand how my grammar parsed the sequence. Since I did not believe that a lexical analyser alone would be sufficient to assist many people with playing with my grammar, I have also provided the basis for a tool to explain what a grammar is doing. Specifically, I have modified a file that is included in the Berkeley YACC distribution so that parsers generated by such a YACC would automatically display a syntax tree in graphical-ASCII format during a parse. The instructions for using and building this yacc tool are presented in the next section. Note that there are no significant special hooks in my grammar or parser to excite this yacc tool, and the tool can be used equally well on any grammar that you are working with. I have posted all 6 files to comp.lang.c++, to make this information as available as possible to users and developers. I will also post this introductory note to comp.compilers, and comp.lang.c. I am arranging for archival support via several ftp sites, and updates will be posted to those sites. I will also try to get the source to Berkeley YACC posted to these ftp sites, although it is certainly available at more central sites. HOW TO USE THE DISTRIBUTION FILES TO PLAY WITH THE C++ GRAMMAR Note that the following instructions assume that you have the Berkeley YACC source on hand. You can actually use any YACC to process the grammar, but running the resulting demo (which has no semantic actions) will tend to be quite boring. If you can't get hold of the Berkeley YACC, you should at least try to enable the "debugging options" in your parser to so that you can see in some way what reductions are taking place. (Hint: search for the letters "debug" in the C file that your yacc produces...). 1) Get the entire source for Berkeley YACC 1.4 2/25/90 2) Verify that you can make the YACC 3) Rename SKELETON.C to SKELOLD.C, and implant my SKELGRPH.C in that directory as SKELETON.C 4) Make the yacc using this new SKELETON.C 5) Using the above yacc, process my CPP4.Y file yacc -dvl cpp4.y The result should be a file y.tab.c, and y.tab.h 6) Using Flex (replacement for lex) process my CPP4.L file flex cpp4.l the result should be yy.lex.c 7) Compile the two files cc -o cpp4 y.tab.c yy.lex.c the result should be an executable called cpp4 8) Set the environment variable YYDEBUG to 6 setenv YYDEBUG 6 If you don't do this, the graphical output will not appear! 9) Run the program cpp4 cpp4 10) Try the input: int a; 11) You should see a nice parse tree. Enjoy. Note that the lexer DOES NOT INCLUDE A SYMBOL TABLE, and does NOT KEEP TRACK OF CURRENT SCOPES. The hack (see the CPP4.L file for details) is to assume that any identifier that begins with a capital letter is a typedef name Send complaints about code that doesn't parse "correctly". INTRODUCTION TO 3/3/90 Version of my release In light of the recent debate on comp.lang.c++ about the "impossibility" of generating a YACC grammar for C++ (that doesn't also accept every possible string of tokens), providing these grammars at this point is especially nice. Rather than addressing each of the ad hoc arguments that such an endeavor is impossible (the least convincing of which said effectively: "... heard it was impossible, ... something about LALR ...", and the most correct of which said: "...it is not possible to create a simple YACC grammar for C++.."), I am offering my grammars as my rather complete commentary on the topic. (I'll probably get some flaming due to this comment, but I had to read all the stuff that was posted, that told me what I had done was impossible. ;-) ) My general feeling is that it is a shame for opinions to be confused with facts, and for impossibility to be confused with difficulty or complexity. Developing the C grammar (that is intended to be compatible with the ANSI C standard) was relatively straight forward (compared to the C++ grammar). The one difficulty in this process was the desire to avoid use of %prec and %assoc constructs in YACC, which would tend to obscure ambiguities. Since I didn't know what ambiguities were lying in wait in C++, obscuring ambiguities was unacceptable. It took several weeks to remove the conflicts that typically appear, and the tedious process exposed several ambiguities that are not currently disambiguated by the ANSI standard. The quality of the C grammar is (IMHO) dramatically higher than what has been made available within the public domain. Specifically, a C grammar's support of redefinition of typedef names within inner scopes (the most difficult area of the grammar) is typically excluded from public domain grammar, and even excluded from most grammars that are supplied commercially with parser generators! I expect that this grammar will be very useful in the development of C related tools. The development of the C++ grammar (initially compatible with version 1.2, but enhanced to support version 2.0 specifications as they were made available) was anything but straight forward. The requirement that I set to NOT USE %prec and %assoc proved both a blessing and a curse. The blessing was that I could see what the problems were in the language, the curse was that there were A LOT of conflicts (I can recall times during the development effort when the number of conflicts was well in excess of 200). Initially I was unaware that many other attempts had been made and failed, and I went ahead "blindly" trying to resolve the conflicts in my grammar. After raising the issues that I noticed with Bjarne Stroustrup, I became aware that there were some very significant syntactical ambiguities within the current C++ language. Fortunately, by the time I first spoke to Dr. Stroustrup, I had already derived some results that other attempts had not uncovered. Encouraged by my results, I continued on despite hearing ever louder claims that my goal was "impossible". Towards the end of the development of the C++ grammar, which took roughly 3 months of my time, I began to see the folly in part of my quest. I came to the conclusion that further attempts to modify my grammar, so as to defer resolutions of ambiguities, would lead to an unreadable language. Specifically, my feeling was that I was entering into a battle of wits with the compiler, and the compiler was starting to win. It was beginning to be the case that the parser COULD figure out what I said, but I couldn't. Indeed, even examples in a version of the C++ 2.0 reference manual demonstrated this problem (my parser could parse some examples that neither I nor the authors parsed correctly!). At this point I decided to stop my quest to FURTHER defer resolutions of ambiguities, and let the grammar commit in one direction (always in favor of declarations), at the late point that is provided by my grammar. If this direction proved "incorrect in light of the context that followed", then I generated a syntax error. I believe this strategy provides ample room for expressiveness. In support of this expressiveness, I have (based on my discussions with language experts) deferred disambiguation far longer than other attempts at producing an LR(1) grammar. I would strongly argue that any code that my grammar identifies as having a "syntax error" (based on "premature" disambiguation), but cfront allows, should ABSOLUTELY be rewritten in a less ambiguous (and hence more portable) form. As a final comment, by virtue of the consistent methodology used to build my grammar, there are a number of clearly unambiguous constructs that my grammar CAN parse, and cfront 2.0 cannot (not to mention recent versions of GNU's G++ and Zortech's C++ compiler). One major motivation for using the C++ grammar that I have provided is that it is capable of supporting old style function definitions (e.g.: main(argc, argv) int argc; char*argv[]; {...} ). To my knowledge, no other C++ parser is currently capable of this feat. I believe this capability was removed from the C++ specification in order to reduce ambiguities in a specific implementation (cfront). As my grammar demonstrates, this action was not necessary. Supporting old style function definition GREATLY eases the transition to the use of C++ from traditional C. I expect that as some parsers begin to support this option, that other parsing engines will be forced in this direction by a competitive marketplace. Using my grammar, and the standards it implies, appears to be a very straightforward approach to this support. A second motivation for using my grammar is that it can be processed by YACC. The advantage in this fact lies with YACC's capability to identify ambiguities. For software manufacturers that are heavily concerned with correctness, this is an INCREDIBLE advantage. My experience with hand written parsers (which usually represent a translation by a fallible human from a grammar to parsing code) is that they evolve and become more correct with time. Ambiguous cases are often misparsed, without the author ever realizing there was a conflict! In contrast, either a YACC grammar supports a given construct, or it doesn't. If a YACC grammar supports a construct, the semantic interpretation is usually rather straight forward. The likelihood of internal errors in the parser is therefore SIGNIFICANTLY reduced. The fact the the grammars I supplied are free of %assoc and %prec operators, implies the grammar are fairly portable, and the conflicts are open to peer code review (and not obscured). If you find any errors in my grammars, I would be DELIGHTED to hear mention of them!!!! These should fall into one of the following categories: 1) The grammar left out the following features of C++... or 2) The grammar mis-parses the following sequences... or 3) The discussion of the following ambiguity is incorrect... 4) The grammar could be simplified as follows... Please send correspondence of this form to jar@ileaf.com. My response to 1's will be to add the feature (if possible!); feel sad that I made a mistake; and feel glad that YOU found it. I will have a similar response to 2's. Responses of type 3 are GREAT, but I haven't found many folks that really get into YACC ambiguities, so I have low expectations... feel free to surprise me!!! :-) :-). Items of type 3 are interesting, but since simplicity is in the eye of the beholder, such suggestions are subject to debate. I would be interested in seeing suggestions in this area with the constraint that they do not increase the number of conflicts in the grammar! Please use YACC to check your suggestions before submitting them. (It is often AMAZING how the slightest change in the grammar can change the number of conflicts, and it took a great deal of work to reach the current state). Distribution site(s) will be set up to distribute updates and or corrections. Postings about the presence of corrections will be made on the net. Since the two grammars (C and C++) were generated in parallel, you should be able to compare non-terminals directly. This will hopefully make it easier to identify the complexities involved with the C++ grammar, and "ignore" those that result from standard ANSI C. In both cases I have left the standard if-if-else ambiguity intact. In the case of ANSI C grammar, this is the only shift-reduce conflict in the grammar. Although there are a number of conflicts in the C++ grammar, there are actually very few classes of problems. In order to disambiguate the C++ grammar enough that YACC can figure out what to do, I was commonly forced to "inline expand" non-terminals found in the C grammar. This expansion allowed YACC to defer disambiguation until it was possible for an LR(1) parser to understand the context. The unfortunate consequence of this inline expansion is a large growth in the number of rules, and the presence of an effective "multiplier" in most cases where conflicts do occur. As a result, any conflicts that arise are multiplied by a factor corresponding to the number of rules I had to list separately. I have grouped the C++ grammar conflicts in the "Status" section of the GRAMMAR.TXT paper, but you are welcome to explore my grammars using YACC directly (be warned that you will need a robust version of YACC to handle the C++ sized grammar). PLEASE do not be put off by the number of conflicts in the C++ grammar. There are VERY FEW CONFLICTS, but my elaborated grammar confuses the count. The GRAMMAR4.TXT paper is FAR from a publishable quality paper, but it discusses many of the issues involved in ambiguities in my grammar, and more generally in the C++ language. I hope GRAMMAR4.TXT it is a vast improvement over "nothing at all", but apologize in advance for lack of polished structure, and the presence of many typos (which must surely be present). I hope you find this almost-paper interesting. My attempts at documenting conflicts have certainly clarified the problems in my mind. Based on my experience with the conflicts I have identified, most current compilers and translator fall prey to the situations that I have uncovered. I hope that other compilers, that do not make use of the grammar I have made available, will at least seek to standardize the resolution of the problems identified. As a small commercial message... I am a freelance consultant, and I travel far and wide to perform contracts. If you like the work that I am presenting in this group of documents, and would like to see a resume or at least talk, please feel free to contact me. I hope that the grammars that I have provided, will lead to many successful C++ processing projects. Jim Roskind Independent Consultant 516 Latania Palm Drive Indialantic FL 32903 (407)729-4348 jar@ileaf.com or ...!uunet!leafusa!jar